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Abstract 

An  algorithm  for  generating  tree  structured  neural  networks  using  a  soft-competitive  recur¬ 
sive  partitioning  rule  is  described.  It  is  demonstrated  that  this  algorithm  grows  robust,  honest 
estimators.  Preliminary  results  on  a  10  class,  240  dimensional  OCR  classification  task  are  pre¬ 
sented  which  show  that  the  tree  out-performs  backpropagation.  Arguments  are  made  which 
suggest  why  this  should  be  the  case.  The  connection  of  the  soft-competitive  splitting  rule  to 
the  twoing  rule  is  described. 

1  Introduction 

In  even  the  simplest  cases,  gradient  descent  algorithms  such  as  backpropagation  are  prone  to  sub- 
optimal  behavior  due  to  spurious  local  minima  (Sontag  and  Sussmann,  1988).  This  problem  is 
related  to  the  strong  interference  effects  that  a  single  backprop  net  will  experience  when  it  is 
trained  to  perform  many  different  sub-tasks  (Jacobs  et  al.,  1991).  This  interference  gives  rise  to 
spurious  local  minima  which  impair  the  net’s  learning  and  generalization.  It  is  therefore  desirable 
to  sub-divide  complex  tasks  into  sub-tasks  which  are  simpler,  since  the  networks  needed  to  process 
these  sub-tasks  will  necessarily  be  less  complex  than  a  general  purpose  net  and  will  therefore  be 
faced  with  fewer  local  minima. 

This  notion  of  sub-division  was  first  implemented  successfully  in  the  CART  algorithm  (Breiman 
et  al.,  1984).  This  method  implemented  very  simple  splits  but  was  succesfully  due  to  the  introduc¬ 
tion  of  pruning  which  prevented  the  solutions  from  becoming  biased  to  the  training  data.  If  one 
knows  of  an  a  priori  sub-task  breakdown,  sub-nets  can  be  constructed  by  hand  (Hampshire  and 
Waibel,  1989).  Otherwise,  the  sub-tasks  must  be  extracted  from  the  data.  Identifying  sub-tasks 
often  requires  the  extraction  of  new  features  which  has  been  done  in  an  unsupervised  manner  (In- 
trator,  1991).  Also  sub-tasks  can  be  identified  as  part  of  the  learning  process  (Reilly  et  al.,  1987; 
Jacobs  et  al.,  1990;  Jacobs  et  al.,  1991;  Sanger,  1991).  It  is  also  possible  to  construct  a  CART-like 
tree  using  perceptrons  to  perform  the  partitioning  (Sankar  and  Mamraone,  1991). 

'Research  was  supported  by  the  National  Science  Foundation,  the  Army  Research  Office,  and  the  Office  of  Naval 
Research. 


In  this  paper  we  present  an  algorithm  which,  unlike  the  algorithms  discussed  above,  untilizes  a 
modified  version  of  the  original  CART  twoing  rule  for  splitting  (Perrone,  1991).  This  new  splitting 
rule  is  a  soft-competitive  (Hinton  and  Nowlan,  1990)  neural  net  anolog  of  the  twoing  rule  which 
constructs  splits  based  on  a  minimization  of  the  interference  between  sub- tasks. 

We  use  the  CART  methodology  of  top-down/bottom-up  pruning  when  growing  our  network  to 
assure  that  we  grow  right-sized  trees  which  are  honest  estimators  for  the  underliemg  probability 
distributions.  This  helps  prevent  the  tree  architecture  from  becoming  biased  to  the  training  data. 
The  additional  problem  of  over-fitting  at  the  sub-network  level  is  avoided  by  using  cross-validation 
as  a  stopping  criterion  for  the  training  of  the  tree  node  networks. 

The  soft-competition  splitting  rule  is  used  to  decide  how  to  divide  tasKs  into  sub-tasks.  The 
soft-competition  in  the  splitting  rule  allows  us  to  easily  determine  which  sub-tasks  are  most  distinct 
and  therefore  is  trying  to  maximize  the  reduction  in  interference  with  each  successive  split 

In  section  2,  the  soft-competition  splitting  rule  is  explained.  In  section  3,  examples  are  given 
which  demonstrate  the  algorithm’s  ability  to  avoid  local  minima  by  smart  choice  of  sub-tasks;  and 
preliminary  results  on  a  real-world  character  recognition  task  are  presented.  Section  4  contains  a 
summary  and  conclusions. 


2  The  Soft-Competition  Splitting  Rule 

Let  pj  be  the  1th  pattern  from  class  i  at  a  given  tree  node.  Let  fj(p\)  be  the  value  of  the  jth 
sigmoidal  output  unit  of  a  backurop  net  for  the  1th  pattern  from  the  rih  class.  Define  the  confusion 
matrix ,  mtJ,  as  follows: 


m 


D/;(Pi) 

The  confusion  matrix  is  thus  measuring  the  overall  signal  of  the  ith  class  in  the  jth  output  of 
the  backprop  net.  Note  that  if  the  net  were  a  perfect  classifier,  then  the  confusion  matrix  would  be 
the  identity  matrix.  Thus,  the  off-diagonal  terms  can  be  thought  of  as  a  measure  of  the  confusion 
between  classes. 

A  class  partition  is  defined  as  a  grouping  of  the  labels  of  the  classes  present  at  a  given  tree 
node  into  two,  distinct  subgroups  a  and  0  of  labels.  For  a  given  class  partition,  we  can  define  a 
confusion  measure ,  M(a,0)  as  the  amount  of  confusion  existing  between  the  two  groups  in  the 
class  partition,  thus: 


M{a,0 )  = 
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where  Na  and  N$  are  the  number  of  classes  in  a  and  0,  respectively;  mtJ  is  an  element  from 
the  confusion  matrix;  and  a  and  0  are  the  subgroups  of  the  partition. 

Now,  the  a  and  0  can  be  thought  of  as  sub-tasks,  so  the  confusion  measure  is  a  measure  of 
the  interference  between  the  two  sub-tasks.  This  gives  us  a  very  convenient  way  of  choosing  sub¬ 
tasks:  we  simply  minimize  the  confusion  measure  over  all  of  the  partitions.  This  can  be  done  with 
an  exhaustive  search,  or,  if  the  number  of  classes  is  too  large,  one  can  use  a  simulated  annealing 
algorithm  on  the  confusion  measure  (Press  et  al.,  1987). 


In  practice,  one  would  train  a  backprop  net  on  the  full  problem  at  a  particular  tree  node,  use 
this  net  to  generate  a  confusion  matrix,  use  the  confusion  matrix  to  generate  a  class  partition  and 
then  train  a  new  net  to  perform  the  partitioning  to  the  children  nodes  of  the  tree. 

The  importance  of  the  soft-competition  that  is  inherent  in  the  definition  of  the  confusion  matrix 
is  the  following.  Consider  the  situation  in  which  the  outputs  for  two  classes  are  very  similar  but  the 
correct  class  is  usually  greater  than  the  other,  and  consider  a  third  class  output  which  is  usually 
much  less  than  both.  If  we  use  hard  competition  by  setting  all  but  one  of  the  outputs  to  one,  we 
throw  away  information  that  tells  us  that  the  relationship  between  the  three  classes  is  not  the  same. 

It  is  also  interesting  to  note  that  the  splitting  rule  above  can  be  thought  of  a  a  neural  net 
implementation  of  the  twoing  rule  described  by  Breiman  (Breiman  et  al.,  1984). 

3  Partitioning  Examples 

In  this  section,  we  present  two  toy  classification  tasks  and  one  real-world  classification  task.  We 
discuso  why  backprop  has  difficulty  and  we  d.uA  hvw  the  tree-structured  algorithm  avuius  these 
problems.  It  should  be  noted  that  we  are  not  claiming  that  backprop  can  not  solve  these  problems, 
but  rather  that  the  tree  algorithms  solves  them  more  easily.  This  is  a  characteristic  that  the  tree 
algorithm  maintains  in  high  dimensional  problems  where  sourious  minima  really  start  to  effect  t  he 
performance  of  backprop. 


Figure  1:  Two,  two-dimensional  deterministic  classification 
problems.  Classes  are  labelled  1,  2  and  3.  Sample  points  are 
uniformly  distributed  within  class  regions. 


The  classification  tasks  we  will  consider  are  depicted  in  figure  3.  Note  that  in  principle  the 
minimal  size  net  needed  to  solve  the  first  task  has  a  single  hidden  layer  of  four  hidden  units,  while 
the  second  task  requires  two  hidden  layers  of  two  hidden  units  each.  In  practice,  however,  the 
gradient  descent  of  backpropagation  will  frequently  leave  the  networks  in  the  sub-optimal  'oca! 
minima  depicted  by  figure  3  since  the  local  minima  are  broad  and  the  global  minima  are  narrow. 

The  tree  network  constructed  by  the  algorithm  in  this  paper  had  no  difficulty  finding  the  global 
minimum  everytime.  (See  figure  3.)  ssince  the  splitting  nets  used  by  the  tree  algorithm  were  always 
less  complex  than  the  single  backprop  nets.  In  the  first  case,  a  minimum  backprop  solution  requires 
a  hidden  layer  with  four  hidden  units,  while  the  tree  solved  the  problem  with  perceptrons.  In  the 
second  case,  the  minimum  backprop  architecture  requires  two  hidden  layers  of  two  hidden  units 
each,  while  the  tree  solved  the  problem  with  two  hidden  units  in  a  single  hidden  layer.  Thus  the 
tree  was  less  prone  to  spurious  local  minima. 

The  tree  algorithm  was  also  tested  on  a  large  real-world  pattern  classification  problem.  The 
numbers  ’0’  through  ’9’  were  taken  from  the  NIST  OCR  database  and  were  used  as  a  classification 
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Figure  2:  Local  minima  solutions  which  backprop  finds.  Note 
that  in  each  case  backprop  is  not  performing  optimally. 


Figure  3:  Optimal  solutions  found  by  tree-structured  back- 
prop  network. 


task  for  both  backprop  and  the  tree  algorithm  described  in  this  paper.  The  numbers  were  hand- 
labelled,  and  preprocessed  into  a  240  dimensional  feature  vectors.  Various  backprop  architectures 
were  trained  using  cross-validation  and  an  independent  testing  set.  The  best  backprop  performance 
was  93-8%  for  a  240-10-10  network  while  the  best  tree  performance  was  95.4%  for  a  tree  using  a 
240-4-2  splitting  network.  Testing  on  this  classification  problem  is  continuing.  More  results  will  be 
presented  at  the  NIPS-91  conference. 

4  Summary  and  Conclusions 

In  this  paper,  we  have  seen  that  the  CART  tree  growing  methodology  combined  with  a  soft- 
competition  splitting  rule  can  generate  robust  tree-structure  neural  networks  by  identifying  sub¬ 
tasks  which  have  a  minimum  of  confusion  between  them.  Smart  partition  choices  lead  to  less 
complex  nets  and  to  less  of  an  impact  from  spurious  minima.  We  have  also  seen  preliminary  results 
that  the  adaptive  partitioning  rule  proposed  in  this  paper  can  reduce  the  interference  problem  in  a 
real-world  classification  task. 
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